Information loss in an optimal maximum likelihood decoding 
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The mutual information between a set of stimuli and the elicited neural responses is compared 
to the corresponding decoded information. The decoding procedure is presented as an artificial 
distortion of the joint probabilities between stimuli and responses. The information loss is quantified. 
Whenever the probabilities are only slightly distorted, the information loss is shown to be quadratic 
in the distortion 
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Understanding the way external stimuli are repre- 
sented at the neuronal level is one central challenge in 
neuroscience. An experimental approach to this end (Op- 
tican and Richmond 1987, Eskandar et al. 1992, Tovee 
et al. 1993, Kjaer et al. 1994, Heller et al. 1995, Rolls 
et al. 1996, Treves et al. 1996, Rolls et al. 1997, Treves 
1997, Rolls and Treves 1998, Rolls et al. 1998) consists 
in choosing a particular set of stimuli s G 5 which can 
be controlled by the experimentalist, and exposing these 
stimuli to a subject whose neural activity is being regis- 
tered. The set of neural responses r G 7?. is then defined 
as the whole collection of recorded events. It is up to the 
researcher to decide which entities in the recorded signal 
are considered as events r. For example, r can be de- 
fined as the firing rate in a fixed time window, or as the 
time difference between two consecutive spikes, or the k 
first principal components of the time variation of the 
recorded potentials in a given interval, and so forth. 

Once the stimulus set S and the response set Ti, have 
been settled, the joint probabilities P(r, s) may be esti- 
mated from the experimental data. This is usually done 
by measuring the frequency of the joint occurrence of 
stimulus s and response r, for all s G 5 and r E TZ. The 
mutual information between stimuli and responses reads 
(Shannon 1948) 



P(r, s) 



P(r)P(s) 



where 



S 
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(2) 
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The mutual information quantifies how much can be 
learned about the identity of the stimulus shown just 
by looking at the responses. Accordingly, and since / 
is symmetrical in r and s, its value is also a measure of 
the amount of information that the stimuli give about 
the responses. From a theoretical point of view, / is 
the most appealing quantity characterizing the degree of 
correlation between stimuli and responses that can be de- 
fined. This stems from the fact that I is the only additive 



functional of P(?', s) ranging from zero (for uncorrelated 
variables) up to the entropy of stimuli or responses (for 
a deterministic one to one mapping) (Fano 1961, Cover 
and Thomas 1991). 

However, even if formally sound, the mutual informa- 
tion has a severe drawback when dealing with experimen- 
tal data. Many times, and specifically when analyzing 
data of multi-unit recordings, the response set TZ is quite 
large, its size increasing exponentially with the number 
of neurons sampled. Therefore, the estimation of P(r, s) 
from the experimental frequencies may be far from accu- 
rate, specially when recording from the vertebrate cor- 
tex, where there are long time scales in the variability 
and statistical structure of the responses. The mutual 
information /, being a non linear function of the joint 
probabilities, is extremely sensitive to the errors that 
may be involved in their measured values. As derived 
in Treves and Panzeri (1995), Panzeri and Treves (1996) 
and Golomb et al. (1997), the mean error in calculat- 
ing / from the frequency table of events r and s is linear 
in the size of the response set. This analytical result 
has been obtained under the assumption that different 
responses behave independently. Although there are sit- 
uations where such a condition does not hold (Victor and 
Purpura, 1997) it is widely accepted that the bias grows 
rapidly with the size of the response set. 

Therefore, a common practice when dealing with large 
response sets is to calculate the mutual information not 
between S and TZ, but between the stimuli and another 
set T each of whose elements t is a function of the true re- 
sponse r, that is, t — t{r) (Treves 1997, Rolls and Treves 
1998). It is easy to show that if the mapping between r 
and t is one to one, then the mutual information between 
S and TZ is the same as the one between S and T. How- 
ever, for one to one mappings, the number of elements in 
T is the same as in TZ. A wiser procedure is to choose 
a set T that is large enough not to lose the relevant in- 
formation, but sufficiently small as to avoid significant 
limited sampling errors. One possibility is to perform 
a decoding procedure (Gochim et al. 1994, Rolls et al. 
1996, Victor and Purpura 1996, Rolls and Treves 1998). 
In this case, T is taken to coincide with S. To make 
this correspondence explicit, the set T will be denoted 
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by S' and its elements t by s'. Each s' in S' is taken 
to be a function of r, and is called the predicted stim- 
ulus of response r. As stated in Panzeri et al. (1999), 
this choice for T is the smallest that could potentially 
preserve the information of the identity of the stimulus. 
The data processing theorem (Cover and Thomas, 1991) 
states that since s' is a function of r alone, and not of 
the true stimulus s eliciting response r, the information 
about the real stimulus can only be lost and not created 
by the transformation from r — > s'. Therefore, the true 
information / is always at least as large as the decoded 
information /^i, the latter being the mutual information 
between S and S'^. In order to have / and Id as close as 
possible, it is of course necessary to choose the best s' for 
every r. The procedure consists in identifying which of 
the stimuli was most probably shown, for every elicited 
response. The conditional probability of having shown 
stimulus s given that the response was r reads 



Clearly, with these definitions the decoded information 



P{s\r) = 



P{r) ■ 



(4) 



Therefore, the stimulus that has most likely elicited 
response r is 



s'(r) = maxP(s|r) — max P(r,s) 



(5) 



P{s',s)^ J2 Pir,s), 

reC(s') 



(6) 



and the overall probability of decoding s', 

p(s') = 5]p(s',.)= J2 (7) 



P{s')P{s) 



(8) 



may be calculated, and has, in fact, been used in several 
experimental analyses (Rolls et al. 1996, Treves 1997, 
Rolls and Treves 1998, Panzeri et al. 1999). However, up 
to date, no rigorous relationship between / and Id has 
been established. The derivation of such a relationship 
is the main purpose here. 

When performing a decoding procedure, r is replaced 
by s' . Such a mapping allows the calculation of P{s' , s), 
after which any additional structure, which may even- 
tually have been present in P{r,s), is neglected. For 
example, if two responses ri and r2 encode the same 
stimulus s' it becomes irrelevant whether, for a given s, 
P{ri, s) is much bigger that P{r2, s) or, on the contrary, 
P{ri,s) « P{r2,s). The only thing that matters is the 
value of the sum of the two: their global contribution to 
P(s',s). As a consequence, it seems natural to consider 
the detailed variation of P(r, s) within each class, when 
estimating the information lost in the decoding. 

In this spirit, and aiming at quantizing such a loss of 
information, P{r, s) is written as 



By means of Eq. (H), a mapping r ^ s' is estab- 
lished: each response has its associated maximum likeli- 
hood stimulus. Equation (^ provides the only definition 
of P{s\r) that strictly follows Bayes' rule, so in this case, 
the decoding is called optimal. There are other alterna- 
tive ways of defining P{s\r) (Georgopoulos et al. 1986, 
Wilson and McNaughton 1993, Seung and Sompolinsky 
1993, Rolls et al. 1996) some of which have the appealing 
property of being simple enough to be plausibly carried 
out by downstream neurons themselves. The purpose of 
this letter, however, is to quantify how much informa- 
tion is lost when passing from r to s' using an optimal 
maximum likelihood decoding procedure. 

In general, there are several r associated with a given 
s' . One may therefore partition the response space TZ 
in separate classes C(s) = {r/s'(r) = s}, one class for 
every stimulus. The number of responses in class s' is where 
Ns' . Of course, some classes may be empty. Here, the 
assumption is made that each r belongs to one and only 
class (that is, Eq. (||) has a unique solution). 

The joint probability of showing stimulus s and decod- 
ing stimulus s'(r) reads 



, , P\s'(r),s] ^, , 

I\s'{r) 



(9) 



where A(r, s) = P(r, s) - P[s'{r), s]/Ns>(^r)- Thus, the 
joint probability P(r, s), which in principle may have 
quite a complicate shape in TZ space, is separated into two 
terms. The first one is flat inside every single class C(s'), 
and the second is whatever needed to re-sum P(r, s). It 
should be noticed that 



reC(s') 

for all s. Summing Eq. (|^) in s, 



(r) 



and 



A(r)=5]A(r,5) 



E A(r)=0. 

reC(s') 



(10) 



(11) 



(12) 



(13) 



Replacing Eqs. (||) and (|ll] ) in the mutual information 
(0), one arrives at 



reC(s'' 



P{r,s) 



Q(r, s) 



(14) 
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where 



Q{r,s) 



P[s'ir),s] 



A(r) 



P[s'ir),s] 
Pis') 



(15) 



is a properly defined distribution, since it can be shown 
to be normaUzed and non-negative. The term in the right 
of Eq. (|lj) is the KuUback-Leibler divergence (Kullback 
1968) between the distributions P and Q, which is guar- 
anteed to be non negative. This confirms the intuitive 
result Id < I, the equality being only valid when 



A{r)P[s'{r),s] = A{r, s)P[s' {r)], 



(16) 



for all r and s. 

Equation (|l^ states the quantitative difference be- 
tween the full and the decoded information, and is the 
main result of this letter. The amount of lost information 
is therefore equal to the informational distance between 
the original probability distribution P{r, s) and a new 
function Q{r, s). It can be easily verified that 



Q{r,s) 
Qir)Q{s) 



(17) 



where 



Qir) = ^Q(r,,s)-P(s), 

S 



(18) 



Therefore, the decoded information can be interpreted 
as a full mutual information between the stimuli and the 
responses, but with a distorted probability distribution 
Q{r, s). In this context, the difference I — Id is no more 
than the distance between the true distribution P(r, s) 
and the distorted one Q{r, s). 

When is Eq. (16) fulfilled? Surely, if there is at most 
one response in each class, A is always zero, and I — Id- 
Also, if P{r, s) is already flat in each class, there is no in- 
formation loss. However, if P(r, s) is not flat inside every 
class, but obeys the condition P(r, s) = Ps'{r)P{s' , s) for 
a suitable P(s', s) and some function Ps' (r) that sums up 
to unity within C(s'), one can easily show that Eq. ( |l^ ) 
holds. Just notice that this case implies that if ri and 
r2 belong to C(s'), then P(ri, s)/P(r2, s) is independent 
of r, for all s. In other words, within each class C(s'), 
the different functions P{r\s) obtained by varying s dif- 
fer from one another by a multiplicative constant. These 
conditions coincide with the ones given by Panzeri et al. 
(1999) for having an exact decoding, within the short 
time limit. However, in the present derivation there are 
no assumptions about the interval in which responses are 
measured. Therefore, the decoding being exact whenever 
Eq. ( p^ is fulfilled is not a consequence of the short time 
limit carried out by Panzeri et al. (1999), but rather, a 
general property of the maximum likelihood decoding. 



Next, by making a second order Taylor expansion of 
Eq. ( p^ ) in the distorsions A(r, s) and A(r) one may 
show that 



21n2 



where 



■ E 

reC(s') 



A(r,s) 



P{s', s)/N,. 



+ 0(A2), (19) 



A(r) 



Pis')/Ns 



(20) 



Therefore, in the small A limit, the difference between / 
and Id is quadratic in the distortions A(r, s) and A(r). 
This means that if in a given situation these quantities 
are guaranteed to be small, then the decoded information 
will be a good estimate of the full information. Equation 
(pO|) is equivalent to 



E{s',s) 



where 



P(r,s) y / P(r) ^ 



^P{s',s)/Ns. 



P{s')/Ns, 



C{s') 

(21) 



(/(0)c(.') = ]^ E "'^W- 

^ ' reC(s') 

As a consequence, the relevant parameter in determin- 
ing the size of E{s\s) is given by the mean value — 
within C(s') — of a function that essentially measures how 
different are the true probability distributions P(?', s) 
and P(r), from their flattened versions P{s',s)/Ns> and 
P{s')/N,,. 

To summarize, this letter presents the maximum like- 
lihood decoding as an artificial — but useful — distortion 
of the distribution P(r, s) within each class C(s'). The 
decoded information is shown to be also a mutual infor- 
mation, the latter calculated with the distorted probabil- 
ity distribution. The difference between / and Id is the 
KuUbach-Leibler distance between the true and distorted 
distributions. As such, it is always non negative, and it 
is easy to identify the conditions for the equality between 
the two information measures. Finally, for small distor- 
tions A, the amount of lost information is expressed as 
a quadratic function in A. In short, the aim of the work 
is to present a formal way of quantizing the effect of an 
optimal maximum likelihood decoding. 

It should be kept in mind that in real situations, where 
only a limited amount of data is available, the estimation 
of P(r\s) may well involve a careful analysis in itself. 
Some kind of assumption (as for example, a Gaussian 
shaped response variability) is usually required. The va- 
lidity of the assumptions made depend on the particular 
data at hand. An inadequate choice for P{r\s) may of 
course lead to a distorted value of /, and in fact, the bias 
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may be in either direction. If the choice of P{r\s) does 
not even allow the correct identification of the maximum 
likelihood stimulus (see Eq. (||)), then the calculated 
value of Id will also be distorted. The purpose of this 
letter, however, is to quantify how much information is 
lost when passing from r to s'{r). No attempt has been 
made to quantify / or lo, for different estimations of 
P{r\s). 

Sometimes, -P(s', s) is defined in terms of P(r, s) with- 
out actually decoding the stimulus to be associated to 
each response. For example, -P(s', s) can be introduced as 
J2rPir.s')P{r,s)/P'^{r) (Treves, 1997). This approach, 
although formally sound, is not based in a r ^ s' map- 
ping, and does not allow a partition of TZ into classes. 
It is therefore is not directly related to the analysis pre- 
sented here. However, there might be analogous deriva- 
tions where one may get to quantify the information loss 
also in this case. 
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It should be kept in mind, however, that when Id is cal- 
culated from actual recordings, its value is typically over- 
estimated, because of limited sampling. Therefore, when 
dealing with real data sets, one may eventually obtain a 
value for Id that surpasses the true mutual information 
/. Nevertheless, whenever the number of elements in S' is 
significantly smaller than the number of responses r, the 
sampling bias in Id will be bound by the one obtained in 
the estimation of /. 



